Goto

Collaborating Authors

 re-evaluating evaluation



Re-evaluating evaluation

Neural Information Processing Systems

Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation -- since there is no harm (computational cost aside) from including all available tasks and agents.


Reviews: Re-evaluating evaluation

Neural Information Processing Systems

This paper applies game theory (of Nash equilibria, though 2-player zero-sum games suffice for most of the arguments) to the problem of evaluating a number of agents on a slate of tasks (and/or against other agents). The evaluation methods presented have the important property of being robust to non-systematic addition of more agents and tasks. The paper casts the problem in a correct mathematical framework leveraging the Hodge decomposition of skew-symmetric matrices, and its generalization in combinatorial Hodge theory. This allows for a very illuminating and general view of the structure of evaluation, leading to a generalization of the Elo rating that eschews the transitivity assumption by embedding ratings in more dimensions, as is required in general, and a "Nash averaging" method which I think will be the more lasting contribution. The position taken by the paper on evaluation is also increasingly important at the current time, in my opinion. There are strong connections to the contextual bandits setting in the dueling case, in which feedback is limited to preferences among pairs of actions, which the learner chooses on the fly ([1] and related work).


Re-evaluating evaluation

Balduzzi, David, Tuyls, Karl, Perolat, Julien, Graepel, Thore

Neural Information Processing Systems

Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task.